{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1e71c381-b455-4b05-897c-8806ca64c3f9",
   "metadata": {},
   "source": [
    "# Predicting Ham vs Spam emails"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "5de34d28-d8cf-4296-b54d-e81d203c7d55",
   "metadata": {},
   "outputs": [],
   "source": [
    "%run Coding_naive_bayes.ipynb # allows us to use the code we wrote\n",
    "import pandas\n",
    "import pathlib"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2963eeca-0648-4568-9476-6c2f9c3a38c0",
   "metadata": {},
   "source": [
    "### 1. Imports and pre-processing data\n",
    "\n",
    "We load the data into a Pandas dataframe, then we preprocess it by adding a column with the (non-repeated) lowercase words in the email."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "21e07ded-1173-4d02-89fc-d7cfcd4eba87",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>spam</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Subject: naturally irresistible your corporate...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Subject: the stock trading gunslinger  fanny i...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Subject: unbelievable new homes made easy  im ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Subject: 4 color printing special  request add...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Subject: do not have money , get software cds ...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Subject: great nnews  hello , welcome to medzo...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Subject: here ' s a hot play in motion  homela...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Subject: save your money buy getting this thin...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Subject: undeliverable : home based business f...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Subject: save your money buy getting this thin...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  spam\n",
       "0  Subject: naturally irresistible your corporate...     1\n",
       "1  Subject: the stock trading gunslinger  fanny i...     1\n",
       "2  Subject: unbelievable new homes made easy  im ...     1\n",
       "3  Subject: 4 color printing special  request add...     1\n",
       "4  Subject: do not have money , get software cds ...     1\n",
       "5  Subject: great nnews  hello , welcome to medzo...     1\n",
       "6  Subject: here ' s a hot play in motion  homela...     1\n",
       "7  Subject: save your money buy getting this thin...     1\n",
       "8  Subject: undeliverable : home based business f...     1\n",
       "9  Subject: save your money buy getting this thin...     1"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Environment variables\n",
    "dir_path = pathlib.Path.cwd()\n",
    "name_dataset = \"emails.csv\"\n",
    "\n",
    "column_emails = \"text\"\n",
    "column_words = \"words\"\n",
    "column_label = \"spam\"\n",
    "\n",
    "# Read dataset\n",
    "emails = pandas.read_csv(dir_path.parents[0] / name_dataset)\n",
    "emails[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5b6d0bca-7c6a-4bd3-83fc-21d9a2d66d50",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helpers (Preprocess) =========================================\n",
    "\n",
    "def split_string_into_unique_words(string):\n",
    "    return list(set(string.split()))\n",
    "\n",
    "def process_series_email(series_text):\n",
    "    \"\"\" Converts text to lower-case then returns list of unique words \"\"\"\n",
    "    series_words = series_text.copy() # copies original series\n",
    "    series_words = series_words.str.lower()\n",
    "    series_words = series_words.apply(split_string_into_unique_words)\n",
    "\n",
    "    return series_words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "066b19a7-563f-42b1-84ab-c51b3dad57db",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "      <th>spam</th>\n",
       "      <th>words</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Subject: naturally irresistible your corporate...</td>\n",
       "      <td>1</td>\n",
       "      <td>[provided, in, gaps, change, aim, changes, int...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Subject: the stock trading gunslinger  fanny i...</td>\n",
       "      <td>1</td>\n",
       "      <td>[kansas, ramble, segovia, herald, libretto, ea...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Subject: unbelievable new homes made easy  im ...</td>\n",
       "      <td>1</td>\n",
       "      <td>[of, in, time, credit, complete, 1, advantage,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Subject: 4 color printing special  request add...</td>\n",
       "      <td>1</td>\n",
       "      <td>[of, now, goldengraphix, canyon, version, our,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Subject: do not have money , get software cds ...</td>\n",
       "      <td>1</td>\n",
       "      <td>[old, ', d, t, cds, be, finish, along, great, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Subject: great nnews  hello , welcome to medzo...</td>\n",
       "      <td>1</td>\n",
       "      <td>[of, 5, in, miilion, um, customers, pleased, a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Subject: here ' s a hot play in motion  homela...</td>\n",
       "      <td>1</td>\n",
       "      <td>[ensuring, toois, predictions, states, shouid,...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Subject: save your money buy getting this thin...</td>\n",
       "      <td>1</td>\n",
       "      <td>[of, now, in, provided, right, tried, cialis, ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Subject: undeliverable : home based business f...</td>\n",
       "      <td>1</td>\n",
       "      <td>[on, ptt, mon, recipient, telecom, unknown, 20...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Subject: save your money buy getting this thin...</td>\n",
       "      <td>1</td>\n",
       "      <td>[of, aicohoi, in, provided, right, now, tried,...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                text  spam  \\\n",
       "0  Subject: naturally irresistible your corporate...     1   \n",
       "1  Subject: the stock trading gunslinger  fanny i...     1   \n",
       "2  Subject: unbelievable new homes made easy  im ...     1   \n",
       "3  Subject: 4 color printing special  request add...     1   \n",
       "4  Subject: do not have money , get software cds ...     1   \n",
       "5  Subject: great nnews  hello , welcome to medzo...     1   \n",
       "6  Subject: here ' s a hot play in motion  homela...     1   \n",
       "7  Subject: save your money buy getting this thin...     1   \n",
       "8  Subject: undeliverable : home based business f...     1   \n",
       "9  Subject: save your money buy getting this thin...     1   \n",
       "\n",
       "                                               words  \n",
       "0  [provided, in, gaps, change, aim, changes, int...  \n",
       "1  [kansas, ramble, segovia, herald, libretto, ea...  \n",
       "2  [of, in, time, credit, complete, 1, advantage,...  \n",
       "3  [of, now, goldengraphix, canyon, version, our,...  \n",
       "4  [old, ', d, t, cds, be, finish, along, great, ...  \n",
       "5  [of, 5, in, miilion, um, customers, pleased, a...  \n",
       "6  [ensuring, toois, predictions, states, shouid,...  \n",
       "7  [of, now, in, provided, right, tried, cialis, ...  \n",
       "8  [on, ptt, mon, recipient, telecom, unknown, 20...  \n",
       "9  [of, aicohoi, in, provided, right, now, tried,...  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emails[column_words] = process_series_email(emails[column_emails])\n",
    "emails[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb7cc9d5-d7db-4dd4-b0d6-86a0b8335982",
   "metadata": {},
   "source": [
    "### 2. Calculate the priors\n",
    "\n",
    "Our label column is boolean, with spam being 1 and ham being 0. Let's calculate the probabilities of seeing a ham or spam email from just the labeled data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "7f092420-ab57-4942-85da-124eef93e6b0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0    4360\n",
      "1    1368\n",
      "Name: spam, dtype: int64\n",
      "Number of emails: 5728\n",
      "Number of spam emails: 1368\n",
      "Probability of spam: 0.2388268156424581\n"
     ]
    }
   ],
   "source": [
    "label_spam = 1\n",
    "label_ham = 0\n",
    "\n",
    "# meta data\n",
    "num_emails = len(emails)\n",
    "counts_label = emails[column_label].value_counts()\n",
    "num_spam = counts_label[label_spam]\n",
    "print(counts_label)\n",
    "\n",
    "print(\"Number of emails:\", num_emails)\n",
    "print(\"Number of spam emails:\", num_spam)\n",
    "\n",
    "# Calculating the prior probability an email is spam.\n",
    "dict_priors = calculate_frequency_average(emails[column_label])\n",
    "print(\"Probability of spam:\", dict_priors[label_spam])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f42981f3-b6c4-4c4f-bccd-9ac0276204c4",
   "metadata": {},
   "source": [
    "### 3. Training the model\n",
    "\n",
    "We'll calculate word frequencies based on the whole text, then train the model by calculating the frequencies for each word in each label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "0bfeb7a3-074c-4b38-b3cb-77d71494c6cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "dict_frequencies_whole_text = construct_frequency_dict_from_series(\n",
    "    emails[column_emails])\n",
    "dict_model = calculate_labeled_frequencies(\n",
    "    dict_frequencies_whole_text, emails, column_label, column_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "7bdedeee-dc30-455d-ba00-307efa44169f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{1: 8.000010682682143, 0: 1.0682682143736557e-05}\n",
      "{1: 38.00005661821536, 0: 41.00005661821536}\n",
      "{1: 64.0002382238118, 0: 317.0002382238118}\n"
     ]
    }
   ],
   "source": [
    "# Some examples (1 is spam, and 0 is ham)\n",
    "print(dict_model['lottery'])\n",
    "print(dict_model['sale'])\n",
    "print(dict_model['already'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11813c8e-a909-4b78-9ad1-44d4de7a62e2",
   "metadata": {},
   "source": [
    "### 4. Using the model to make predictions\n",
    "\n",
    "We can see the probability a word is associated with spam given our data. We can also add new words to our model by calculating their word frequencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ca50db6f-3f04-43db-8c6e-374da021f3fa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9999986646682983\n",
      "0.4810126854437434\n",
      "0.1679794178226178\n"
     ]
    }
   ],
   "source": [
    "print(predict_bayes('lottery', label_spam, dict_model))\n",
    "print(predict_bayes('sale', label_spam, dict_model))\n",
    "print(predict_bayes('already', label_spam, dict_model))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d7dc157f-f789-41c7-b381-33a9aa212734",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability email is spam:  0.9999999005387217\n",
      "Probability email is spam:  0.09520065375112553\n",
      "Probability email is spam:  0.25112821625045023\n",
      "Probability email is spam:  3.4107286996145865e-11\n",
      "Probability email is spam:  0.9999999987178892\n",
      "Probability email is spam:  0.9999999999343989\n",
      "Probability email is spam:  0.9999999996071451\n",
      "Probability email is spam:  0.5000000000000001\n"
     ]
    }
   ],
   "source": [
    "list_emails = [\n",
    "    \"lottery sale\",\n",
    "    \"Hi mom how are you\",\n",
    "    \"Hi MOM how aRe yoU afdjsaklfsdhgjasdhfjklsd\",\n",
    "    \"meet me at the lobby of the hotel at nine am\",\n",
    "    \"enter the lottery to win three million dollars\",\n",
    "    \"buy cheap lottery easy money now\",\n",
    "    \"buy cheap lottery easy money\"\n",
    "    \"Grokking Machine Learning by Luis Serrano\",\n",
    "    \"asdfgh\"]\n",
    "\n",
    "# Adding new words to our dictionary\n",
    "dict_frequencies_new_words = construct_frequency_dict_from_strings(list_emails)\n",
    "dict_model = setup_naive_bayes(\n",
    "    dict_frequencies_new_words, emails, column_label, column_words, column_emails)\n",
    "cout = \"Probability email is spam: \"\n",
    "for email in list_emails:\n",
    "    print(cout, predict_naive_bayes(email, label_spam, dict_model, emails[column_label]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ef365b9f-44ce-43f0-b17b-246212101525",
   "metadata": {},
   "source": [
    "### 5. Do our results make sense?\n",
    "The \"Grokking Machine Learning by Luis Serrano\" classification was surprising. Or was it? Let's check how often a word like \"serrano\" appears in spam emails."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7f368d61-ade7-4a61-bcae-e54881d9df86",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{1: 1.0000005876034477, 0: 5.876034475869477e-07}\n",
      "0.9999994123972429\n"
     ]
    }
   ],
   "source": [
    "print(dict_model['serrano'])\n",
    "print(predict_bayes('serrano', label_spam, dict_model))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f07dbc1-273c-47fd-9270-e6dddc2f671f",
   "metadata": {},
   "source": [
    "Hmm, that seeems pretty high. But, if we look closer at the training data, the following email was labaled spam and has \"serrano\"!\n",
    "\n",
    "> Subject: important announcement : your application was approved  we tried to contact you last week about refinancing your home at a lower rate .  i would like to inform you know that you have been pre - approved .  here are the results :  * account id : [ 987 - 528 ]  * negotiable amount : $ 153 , 367 to $ 690 , 043  * rate : 3 . 70 % - 5 . 68 %  please fill out this quick form and we will have a broker contact you as soon as possible .  regards ,  shannon **serrano** senior account manager  lyell national lenders , llc .  database deletion :  www . lend - bloxz . com / r . php\n",
    "\n",
    "Talk about bad luck. This highlights the importance of cleaning the data before you train!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bcd1eb3-b4b0-424a-923e-29c22e221bed",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
